| Variable Name | Explanation |
|---|---|
| ID | Number corresponding to the precise combination of the features of the model |
| Model Year | Year of the model of the car |
| Make | The brand of the car |
| Model | The model of the car |
| Estimated Annual Petroleum Consumption (Barrels) | Consumption in Petroleum Barrels |
| Fuel Type 1 | First fuel energy source, only source if not an hybrid car |
| City MPG (Fuel Type 1) | |
| Highway MPG (Fuel Type 1) | |
| Combined MPG (Fuel Type 1) | |
| Fuel Type 2 | Second energy source if hybrid car |
| City MPG (Fuel Type 2) | |
| Highway MPG (Fuel Type 2) | |
| Combined MPG (Fuel Type 2) | |
| Engine Cylinders | From 2 to 16 cylinders |
| Engine Displacement | Measure of the cylinder volume swept by all of the pistons of a piston engine, excluding the combustion chambers |
| Drive | Description of the car, e.g. Turbo, Stop-Start, ... |
| Engine Description | Manual/Automatic transmission, with number of gears and/or model of transmission |
| Transmission | e.g. Minivan, Trucks, Midsize,.... |
| Vehicle Class | |
| Time to Charge EV (hours at 120v) | |
| Time to Charge EV (hours at 240v) | |
| Range (for EV) | |
| City Range (for EV - Fuel Type 1) | |
| City Range (for EV - Fuel Type 2) | |
| Hwy Range (for EV - Fuel Type 1) | |
| Hwy Range (for EV - Fuel Type 2) |
1 Introduction
• The context and background: course, company name, business context.
During our 1st master year as students in Management - orientation Business Analytics, we have had the opportunity to attend some lectures of Machine Learning for Business Analytics. In content of this class, we have seen multiple machine learning techniques for business context, mainly covering supervised (regressions, trees, support vector machine, neural networks) and unsupervised methods (clustering, PCA, FAMD, Auto-Encoder) but also other topics such as data splitting, ensemble methods and metrics.
• Aim of the investigation: major terms should be defined, the question of research (more generally the issue), why it is of interest and relevant in that context.
In the context of this class, our group have had the opportunity to work on an applied project. From scratch, we had to look for some potential dataset for using on real cases what we have learned in class. Thus, we had found an interesting dataset concerning vehicule MPG, range, engine stats and more, for more than 100 brands. The goal of our research was to predict the make (i.e. the brand) of the car according to its characteristics (consumption, range, fuel type, … ) thanks to a model that we would have trained (using RF, ANN or Trees). As some cars could have several identical characteristics, but could differentiate on various other ones, we thought that it would be pertinent to have a model that was able to predict a car brand, from its features.
• Description of the data and the general material provided and how it was made available (and/or collected, if it is relevant). Only in broad terms however, the data will be further described in a following section. Typically, the origin/source of the data (the company, webpage, etc.), the type of files (Excel files, etc.), and what it contains in broad terms (e.g. “a file containing weekly sales with the factors of interest including in particular the promotion characteristics”).
The csv dataset has been found on data.world, a data catalog platform that gather various open access datasets online. The file contains more than 45’000 rows and 26 columns, each colomn concerning one feature (such as the year of the brand, the model, the consumption per barrel, the highway mpg per fuel type and so on).
• The method that is used, in broad terms, no details needed at this point. E.g. “Model based machine learning will help us quantifying the important factors on the sales”.
Among these columns, we have had to find a machine learning model that could help us quantify the importance of the features in predicting the make of the car. Various models will be tried for both supervised and unsupervised learnings.
• An outlook: a short paragraph indicating from now what will be treated in each following sections/chapters. E.g. “in Section 3, we describe the data. Section 4 is dedicated to the presentation of the text mining methods…” In the following sections, you will find 1st the description in the data, then in Section 2 the method used, in Section 3 the results, in Section 4 our conclusion and recommendations and finally in Section 5 our references. From now on, we will go through different sections. Section 2 will be dedicated in the data description in more depth, mentioning the variables and features, the instances, the type of data and eventually some missing data patterns. Then, the next section will cover Exploratory Data Analysis (EDA), where some vizualisations will be made in order to better perceive some patterns in the variables as well as potential correlation. After that, section 4 will be about the methods which will first be divided between Supervised and then Unsupervised in order to find a suitable model for our project. The results will be discussed right after and we will proceed with a conclusion, as well as recommendations and discussions. Finally, the references and appendix will be visible at the end of the report.
2 Data description
- Description of the data file format (xlsx, csv, text, video, etc.) DONE
- The features or variables: type, units, the range (e.g. the time, numerical, in weeks from January 1, 2012 to December 31, 2015), their coding (numerical, the levels for categorical, etc.), etc. TABLE-NTBF
- The instances: customers, company, products, subjects, etc. DONE
- Missing data pattern: if there are missing data, if they are specific to some features, etc. NTBD
- Any modification to the initial data: aggregation, imputation in replacement of missing data, recoding of levels, etc. NTBD
- If only a subset was used, it should be mentioned and explained; e.g. inclusion criteria. Note that if inclusion criteria do not exist and the inclusion was an arbitrary choice, it should be stated as such. One should not try to invent unreal justifications. NTBD
“For this project, we selected a dataset focused on vehicle characteristics, available as a .csv file from data.world. You can access the dataset via the following link: data.world. It includes a total of 26 features describing 45,896 vehicle models released between 1984 and 2023. Below is a table providing an overview of the available features and their descriptions. You can find a deeper description of the data in ?@sec-Annex.”
2.0.1 The features or variables: type, units,…
2.1 The instances: customers, company, products, subjects, etc.
In a basic instance, each row is concerning one car. We can find in order the ID of the car corresponding to a precise feature observation, then the features as seen in the table before.
2.2 Missing data pattern: if there are missing data, if they are specific to some features, etc.
2.3 Any modification to the initial data: aggregation, imputation in replacement of missing data, recoding of levels, etc.
2.4 If only a subset was used, it should be mentioned and explained; e.g. inclusion criteria. Note that if inclusion criteria do not exist and the inclusion was an arbitrary choice, it should be stated as such. One should not try to invent unreal justifications.
EDA:
Columns description
To begin with our EDA, let’s have a look at our dataset and in particular the characteristics of the columns.
Show the code
#to get a detailed summary
skim(data)| Name | data |
| Number of rows | 45896 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| make | 0 | 1.00 | 3 | 34 | 0 | 141 | 0 |
| model | 0 | 1.00 | 1 | 47 | 0 | 4762 | 0 |
| fuel_type_1 | 0 | 1.00 | 6 | 17 | 0 | 6 | 0 |
| fuel_type_2 | 44059 | 0.04 | 3 | 11 | 0 | 4 | 0 |
| drive | 1186 | 0.97 | 13 | 26 | 0 | 7 | 0 |
| engine_Description | 17031 | 0.63 | 1 | 46 | 0 | 589 | 0 |
| transmission | 11 | 1.00 | 12 | 32 | 0 | 40 | 0 |
| vehicle_class | 0 | 1.00 | 4 | 34 | 0 | 34 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1.00 | 23102.11 | 13403.10 | 1.00 | 11474.75 | 23090.50 | 34751.25 | 46332.00 | ▇▇▇▇▇ |
| model_year | 0 | 1.00 | 2003.61 | 12.19 | 1984.00 | 1992.00 | 2005.00 | 2015.00 | 2023.00 | ▇▆▆▇▇ |
| estimated_Annual_Petrolum_Consumption_Barrels | 0 | 1.00 | 15.33 | 4.34 | 0.05 | 12.94 | 14.88 | 17.50 | 42.50 | ▁▇▃▁▁ |
| City_MPG_Fuel_Type_1 | 0 | 1.00 | 19.11 | 10.31 | 6.00 | 15.00 | 17.00 | 21.00 | 150.00 | ▇▁▁▁▁ |
| highway_mpg_fuel_type_1 | 0 | 1.00 | 25.16 | 9.40 | 9.00 | 20.00 | 24.00 | 28.00 | 140.00 | ▇▁▁▁▁ |
| combined_MPG_Fuel_Type_1 | 0 | 1.00 | 21.33 | 9.78 | 7.00 | 17.00 | 20.00 | 23.00 | 142.00 | ▇▁▁▁▁ |
| City_MPG_Fuel_Type_2 | 0 | 1.00 | 0.85 | 6.47 | 0.00 | 0.00 | 0.00 | 0.00 | 145.00 | ▇▁▁▁▁ |
| highway_mpg_fuel_type_2 | 0 | 1.00 | 1.00 | 6.55 | 0.00 | 0.00 | 0.00 | 0.00 | 121.00 | ▇▁▁▁▁ |
| combined_MPG_Fuel_Type_2 | 0 | 1.00 | 0.90 | 6.43 | 0.00 | 0.00 | 0.00 | 0.00 | 133.00 | ▇▁▁▁▁ |
| engine_cylinders | 487 | 0.99 | 5.71 | 1.77 | 2.00 | 4.00 | 6.00 | 6.00 | 16.00 | ▇▇▅▁▁ |
| engine_displacement | 485 | 0.99 | 3.28 | 1.36 | 0.00 | 2.20 | 3.00 | 4.20 | 8.40 | ▁▇▅▂▁ |
| time_to_Charge_EV_hours_at_120v_ | 0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| charge_time_240v | 0 | 1.00 | 0.11 | 1.01 | 0.00 | 0.00 | 0.00 | 0.00 | 15.30 | ▇▁▁▁▁ |
| range_for_EV | 0 | 1.00 | 2.36 | 24.97 | 0.00 | 0.00 | 0.00 | 0.00 | 520.00 | ▇▁▁▁▁ |
| range_ev_city_fuel_type_1 | 0 | 1.00 | 1.62 | 20.89 | 0.00 | 0.00 | 0.00 | 0.00 | 520.80 | ▇▁▁▁▁ |
| range_ev_city_fuel_type_2 | 0 | 1.00 | 0.17 | 2.73 | 0.00 | 0.00 | 0.00 | 0.00 | 135.28 | ▇▁▁▁▁ |
| range_ev_highway_fuel_type_1 | 0 | 1.00 | 1.51 | 19.70 | 0.00 | 0.00 | 0.00 | 0.00 | 520.50 | ▇▁▁▁▁ |
| range_ev_highway_fuel_type_2 | 0 | 1.00 | 0.16 | 2.46 | 0.00 | 0.00 | 0.00 | 0.00 | 114.76 | ▇▁▁▁▁ |
The dataset that we are working with contains approx. 46’000 rows and 26 columns. We can see that most of our features are concerning the consumption of the cars. In addition, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will be handle these in the section “data cleaning”.
Exploration of the distribution
Here are more details about the distribution of the numerical features.
Show the code
# melt.data <- melt(data)
#
# ggplot(data = melt.data, aes(x = value)) +
# stat_density() +
# facet_wrap(~variable, scales = "free")
plot_histogram(data)# Time.to.Charge.EV..hours.at.120v. not appearing because all observations = 0 We notice that most of our observations of our features are 0s because of the nature of the features. For instance, as most of our cars are not hybrids, they have a unique type of fuel and don’t have any type 2, which results in a 0 in the concerned features. Also, some features are numerical discrete, as we can see on the plot of the column “Engine Cylinders”.
Outliers Detection
For each one of our numerical column, let’s check thank to the boxplot the outliers per feature
Show the code
#tentative boxplots
data_long <- data %>%
select_if(is.numeric) %>%
pivot_longer(cols = c("ID",
"model_year",
"estimated_Annual_Petrolum_Consumption_Barrels", "City_MPG_Fuel_Type_1",
"highway_mpg_fuel_type_1",
"combined_MPG_Fuel_Type_1",
"City_MPG_Fuel_Type_2",
"highway_mpg_fuel_type_2",
"combined_MPG_Fuel_Type_2",
"time_to_Charge_EV_hours_at_120v_",
"charge_time_240v",
"range_for_EV",
"range_ev_city_fuel_type_1",
"range_ev_city_fuel_type_2",
"range_ev_highway_fuel_type_1",
"range_ev_highway_fuel_type_2"), names_to = "variable", values_to = "value")
ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
geom_boxplot(outlier.size = 0.5) + # Make outlier points smaller
facet_wrap(~ variable, scales = "free_y") + # Each variable gets its own y-axis
theme_minimal() +
theme(legend.position = "none", # Hide the legend
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 0),strip.text = element_text(size = 7)) + # Rotate x-axis labels
labs(title = "Boxplots of Variables with Different Scales", x = "", y = "Value")Show the code
#Now
# plot_correlation(data) #drop time charge EV 120V
# create_report(data)
#nb cars per brandnumber of models per make
Now let’s check how many models per make we have in our dataset. In order to have a clear plot, we have decided to keep the top 20 brands among all the make on the graph. All the remaining makes are accessible on the table just below.
Show the code
#Number of occurences/model per make
nb_model_per_make <- data %>%
group_by(make, model) %>%
summarise(Number = n(), .groups = 'drop') %>%
group_by(make) %>%
summarise(Models_Per_Make = n(), .groups = 'drop') %>%
arrange(desc(Models_Per_Make))
#table
datatable(nb_model_per_make,
rownames = FALSE,
options = list(pageLength = 10,
class = "hover",
searchHighlight = TRUE))Show the code
# Option to limit to top 20 makes for better readability
top_n_makes <- nb_model_per_make %>% top_n(20, Models_Per_Make)
# Reordering the Make variable within the plotting code to make it ordered by Models_Per_Make descending
# nb_model_per_make$Make <- factor(nb_model_per_make$Make, levels = nb_model_per_make$Make[order(-nb_model_per_make$Models_Per_Make)])Show the code
ggplot(top_n_makes, aes(x = reorder(make, Models_Per_Make), y = Models_Per_Make)) +
geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
labs(title = "Models per Make (Top 20)",
x = "Make",
y = "Number of Models") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(hjust = 1, size = 10),
plot.title = element_text(size = 14)) +
coord_flip() # Flip coordinates for better readabilityTherefore, we can see that Mercendes-Benz and BMW have significantly more models in our dataset, which means that we are dealing with some imbalances in categories. Therefore, we need to be careful when doing predictions, as will may encounter bias toward these two majority classes. Therefore, there are few technics that can be used to deal with this problem, such as resampling technics, Ensemble Methods (RF, Boosting), tuning probability threshold
https://chatgpt.com/c/09a66e4e-80c6-4fbd-bf4e-73a2b3e44afd
Correlation matrix for numerical features
Show the code
#Here I encounter some problems with some of the variables.
#check NAs
# colSums(is.na(data))
data_corrplot <- data %>%
select_if(is.numeric)
# Identify constant columns (columns with zero standard deviation)
constant_columns <- sapply(data_corrplot, function(x) sd(x, na.rm = TRUE) == 0)
# Print constant columns for inspection
print("Constant columns (standard deviation is zero):")[1] "Constant columns (standard deviation is zero):"
Show the code
print(names(data_corrplot)[constant_columns])[1] "time_to_Charge_EV_hours_at_120v_"
Show the code
# Remove constant columns
data_corrplot <- data_corrplot[, !constant_columns]
# Correlation transformation for plot using complete observations
cor_matrix <- cor(data_corrplot, use = "complete.obs")Warning in cor(data_corrplot, use = "complete.obs"): the standard deviation is
zero
Show the code
# Melt correlation matrix for plotting
cor_melted <- melt(cor_matrix)
# Plot correlation matrix heatmap using ggplot2
ggplot(data = cor_melted, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 8, hjust = 1),
axis.text.y = element_text(size = 8)) +
coord_fixed() +
labs(x = '', y = '', title = 'Correlation Matrix Heatmap')3 Data cleaning
In this section we will handle the missing value of our dataset to make sure that we have a clean dataset to perform our EDA and modeling. We will first visualize the missing values in our dataset and then clean the missing values in the columns that we will use for our analysis. We will also remove some rows and columns that are not relevant for our analysis.
Let’s have a look at the entire dataset and its missing values in grey.
We can see that overall, we do not have many missing values in proportion with the size of our dataset. However, we can see that some columns have a lot of missing values. Let’s have a look at the columns and rows with missing values more in details.
We can now more easily see the missing in our data. Below we have the detail of the pourcentage of missing values by columns.
Let’s first have a closer look at the engine cylinders and engine displacement columns.
We see that all the {r} miss_elec missing values in “Engine Cylinders” and “Engine Displacement” vehicle fuel type is only “{r} fuel_type_1_miss”. Therefore, we can conclude that all the missing values in “Engine Cylinders” and “Engine Displacement” vehicle fuel type represent all our electric vehicle. This make sense since electric vehicle do not have an combustion engine and therefore those categories are not really applicable. We will therefore replace all missing values in this two columns with “none”.
Show the code
# Create a summary dataframe of missing values by column
missing_summary_df2 <- data_cleaning %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Count") %>%
mutate(
Total_Rows = nrow(data),
Proportion_Missing = Missing_Count / Total_Rows
) %>%
arrange(desc(Proportion_Missing)) %>%
select(Column, "Missing values" = Missing_Count, "Prop. Missing" = Proportion_Missing)
# Print the summary dataframe
datatable(missing_summary_df2,
options = list(pageLength = 6,
class = "hover",
searchHighlight = TRUE),
rownames = FALSE)%>%
formatPercentage("Prop. Missing", 2)Show the code
# Count the missing 'Drive' values per brand
missing_drive_by_make <- data_cleaning %>%
filter(is.na(Drive)) %>%
count(Make)
# Get total counts per brand in the entire dataset
total_counts_by_make <- data_cleaning %>%
count(Make)
# Calculate the percentage of missing 'Drive' values per brand
percentage_missing_drive_by_make <- missing_drive_by_make %>%
left_join(total_counts_by_make, by = "Make", suffix = c(".missing", ".total")) %>%
mutate(PercentageMissing = (n.missing / n.total)) %>%
arrange(desc(PercentageMissing))
# Print the summary dataframe
datatable(percentage_missing_drive_by_make,
options = list(pageLength = 6,
class = "hover",
searchHighlight = TRUE),
rownames = FALSE)%>%
formatPercentage("PercentageMissing", 2)Show the code
# Calculate the percentage of missing 'Drive' values per brand
brand_summary <- data_cleaning %>%
group_by(Make) %>%
summarise(Total = n(),
Missing = sum(is.na(Drive)),
PercentageMissing = (Missing / Total))
# Identify brands with more than 10% missing 'Drive' data
brands_to_remove <- brand_summary %>%
filter(PercentageMissing > brand_missing_threshold) %>%
pull(Make)
# Filter out these brands from the dataset
data_filtered <- data_cleaning %>%
filter(!(Make %in% brands_to_remove))
# For the remaining data, drop rows with missing 'Drive' values
data_cleaning2 <- data_filtered %>%
filter(!is.na(Drive))Show the code
# Create a summary dataframe of missing values by column
missing_summary_df3 <- data_cleaning2 %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Count") %>%
mutate(
Total_Rows = nrow(data),
Proportion_Missing = Missing_Count / Total_Rows
) %>%
arrange(desc(Proportion_Missing)) %>%
select(Column, "Missing values" = Missing_Count, "Prop. Missing" = Proportion_Missing)
# Print the summary dataframe
datatable(missing_summary_df3,
options = list(pageLength = 6,
class = "hover",
searchHighlight = TRUE),
rownames = FALSE)%>%
formatPercentage("Prop. Missing", 2)Show the code
# Remove rows where the 'Transmission' column has missing values
data_cleaning3 <- data_cleaning2 %>%
filter(!is.na(Transmission))
data_cleaning4 <- data_cleaning3 %>%
mutate(Fuel.Type.2 = replace_na(Fuel.Type.2, "none"))Show the code
# Create a summary dataframe of missing values by column
missing_summary_df3 <- data_cleaning3 %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Count") %>%
mutate(
Total_Rows = nrow(data_cleaning3),
Proportion_Missing = Missing_Count / Total_Rows
) %>%
arrange(desc(Proportion_Missing)) %>%
select(Column, "Missing values" = Missing_Count, "Prop. Missing" = Proportion_Missing)
# Print the summary dataframe
datatable(missing_summary_df3,
options = list(pageLength = 3,
class = "hover",
searchHighlight = TRUE),
rownames = FALSE)%>%
formatPercentage("Prop. Missing", 2)EDA:
Columns description
To begin with our EDA, let’s have a look at our dataset and in particular the characteristics of the columns.
Show the code
#to get a detailed summary
skim(data)| Name | data |
| Number of rows | 42240 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| make | 0 | 1 | 3 | 34 | 0 | 129 | 0 |
| vehicle_class | 0 | 1 | 4 | 34 | 0 | 34 | 0 |
| drive | 0 | 1 | 13 | 26 | 0 | 7 | 0 |
| engine_cylinders | 3 | 1 | 1 | 4 | 0 | 10 | 0 |
| engine_displacement | 2 | 1 | 1 | 4 | 0 | 67 | 0 |
| transmission | 0 | 1 | 12 | 32 | 0 | 39 | 0 |
| fuel_type_1 | 0 | 1 | 6 | 17 | 0 | 6 | 0 |
| fuel_type_2 | 0 | 1 | 3 | 11 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| model_year | 0 | 1 | 2004.62 | 11.86 | 1984 | 1994 | 2006 | 2015 | 2023.00 | ▆▆▆▇▇ |
| city_mpg_fuel_type_1 | 0 | 1 | 19.11 | 10.63 | 6 | 15 | 17 | 21 | 150.00 | ▇▁▁▁▁ |
| highway_mpg_fuel_type_1 | 0 | 1 | 25.06 | 9.66 | 9 | 20 | 24 | 28 | 140.00 | ▇▁▁▁▁ |
| city_mpg_fuel_type_2 | 0 | 1 | 0.90 | 6.72 | 0 | 0 | 0 | 0 | 145.00 | ▇▁▁▁▁ |
| highway_mpg_fuel_type_2 | 0 | 1 | 1.06 | 6.79 | 0 | 0 | 0 | 0 | 121.00 | ▇▁▁▁▁ |
| range_ev_city_fuel_type_1 | 0 | 1 | 1.76 | 21.77 | 0 | 0 | 0 | 0 | 520.80 | ▇▁▁▁▁ |
| range_ev_highway_fuel_type_1 | 0 | 1 | 1.64 | 20.52 | 0 | 0 | 0 | 0 | 520.50 | ▇▁▁▁▁ |
| range_ev_city_fuel_type_2 | 0 | 1 | 0.19 | 2.85 | 0 | 0 | 0 | 0 | 135.28 | ▇▁▁▁▁ |
| range_ev_highway_fuel_type_2 | 0 | 1 | 0.18 | 2.57 | 0 | 0 | 0 | 0 | 114.76 | ▇▁▁▁▁ |
| charge_time_240v | 0 | 1 | 0.12 | 1.06 | 0 | 0 | 0 | 0 | 15.30 | ▇▁▁▁▁ |
The dataset that we are working with contains approx. 46’000 rows and 26 columns. We can see that most of our features are concerning the consumption of the cars. In addition, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will be handle these in the section “data cleaning”.
Exploration of the distribution
Here are more details about the distribution of the numerical features.
Show the code
# melt.data <- melt(data)
#
# ggplot(data = melt.data, aes(x = value)) +
# stat_density() +
# facet_wrap(~variable, scales = "free")
plot_histogram(data)# Time.to.Charge.EV..hours.at.120v. not appearing because all observations = 0 We notice that most of our observations of our features are 0s because of the nature of the features. For instance, as most of our cars are not hybrids, they have a unique type of fuel and don’t have any type 2, which results in a 0 in the concerned features. Also, some features are numerical discrete, as we can see on the plot of the column “Engine Cylinders”.
Outliers Detection
For each one of our numerical column, let’s check thank to the boxplot the outliers per feature
Show the code
#tentative boxplots
data_long <- data %>%
select_if(is.numeric) %>%
pivot_longer(cols = c("model_year", "city_mpg_fuel_type_1", "highway_mpg_fuel_type_1", "city_mpg_fuel_type_2", "highway_mpg_fuel_type_2", "range_ev_city_fuel_type_1", "range_ev_highway_fuel_type_1", "range_ev_city_fuel_type_2", "range_ev_highway_fuel_type_2", "charge_time_240v"), names_to = "variable", values_to = "value")
ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
geom_boxplot(outlier.size = 0.5) + # Make outlier points smaller
facet_wrap(~ variable, scales = "free_y") + # Each variable gets its own y-axis
theme_minimal() +
theme(legend.position = "none", # Hide the legend
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 0),strip.text = element_text(size = 7)) + # Rotate x-axis labels
labs(title = "Boxplots of Variables with Different Scales", x = "", y = "Value")Show the code
#Now
# plot_correlation(data) #drop time charge EV 120V
# create_report(data)
#nb cars per brandnumber of models per make
As we got rid of XXX, …..
Correlation matrix for numerical features
Show the code
#Here I encounter some problems with some of the variables.
#check NAs
# colSums(is.na(data))
data_corrplot <- data %>%
select_if(is.numeric)
# Identify constant columns (columns with zero standard deviation)
constant_columns <- sapply(data_corrplot, function(x) sd(x, na.rm = TRUE) == 0)
# Print constant columns for inspection
print("Constant columns (standard deviation is zero):")[1] "Constant columns (standard deviation is zero):"
Show the code
print(names(data_corrplot)[constant_columns])character(0)
Show the code
# Remove constant columns
data_corrplot <- data_corrplot[, !constant_columns]
# Correlation transformation for plot using complete observations
cor_matrix <- cor(data_corrplot, use = "complete.obs")
# Melt correlation matrix for plotting
cor_melted <- melt(cor_matrix)
# Plot correlation matrix heatmap using ggplot2
ggplot(data = cor_melted, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 8, hjust = 1),
axis.text.y = element_text(size = 8)) +
coord_fixed() +
labs(x = '', y = '', title = 'Correlation Matrix Heatmap')4 Classification Tree
In this section we are going to perform a classification tree analysis on the dataset. We will first load the necessary packages and the dataset. We then prepare the data by encoding categorical variables and splitting it into training and testing sets. We then tried to pruned the tree with different max_depth values to find the optimal tree depth that balances between training and test accuracy.
We first loaded the dataset and identified make as the target variable. We also encoded categorical variables using Label Encoding to convert them into numerical values.
We then splited the dataset into training (80%) and testing (20%) sets to be able to evaluate the model’s performance on unseen data after the training to check wheter the model is overfitting or not. We will see that it does.
Trained a Decision Tree classifier on the training data without any constraints. The “None” case below represent the case without the pruning of the tree. As we can see, we observed overfitting, with high accuracy on training data and slightly lower accuracy on test data. Therefore, we decided to prune the tree as it as the advantage so simplify models and therefore limit overfitting. We chose to prune the tree by trying a few max_depth parameter values to control the tree’s growth (none, 5, 10, 15, 20, 25, 30). We want here to find the optimal tree depth that balances between training and test accuracy.
max_depth Training Accuracy Test Accuracy
5 0.2605 0.2550
10 0.4887 0.4674
15 0.7203 0.6352
20 0.8520 0.6926
25 0.8898 0.6972
30 0.8939 0.6971
None 0.8939 0.6979
The model’s accuracy improved as the tree’s depth increased up to a point, with a max_depth of 25 or 30 providing the best test accuracy up to 70%. We see that reducing the max_depth to 10 or 15 improves the balance between, therefore reduce drastically the case of overfitting but this is at the expense of the accuracy of our model on new data. But we can see that pruning the tree with a max depth of 25 allows us to increase our accuracy from 69.84% to 70% therefore increasing the accuracy of our model and at the same time, it reduce the gap between the test set and the trainig set. In our case, pruning the Decision Tree helps in improving its generalization performance by preventing it from becoming too complex and reduce overfitting the training data.
5 Neural Network
In this section, we will build a neural network model to predict the make of a car based on the features at our disposal. We will preprocess the data, split it into training and testing sets, define the neural network architecture, compile the model, train it and evaluate its performance.
5.1 Preprocessing and splitting the data
The dataset contains different types of data. Some columns are numerical (like “city_mpg_fuel_type_1” or “charge_time_240v”), and some are categorical (“vehicle_class” or “fuel_type”). We identify and separate these two types of columns. Separating numerical and categorical columns is an essential step in data preprocessing because they require different types of handling to prepare them for machine learning algorithms. The numerical columns need to be scaled by adjusting them so they have a mean of zero and a standard deviation of one, which helps the machine learning algorithm perform better. While the categorical columns need to be one-hot encoded which creates a binary column a format that the machine learning model can understand.
Show the code
# Load the data
data = pd.read_csv(here("data/data_cleaned.csv"))
# Identify categorical and numerical columns
categorical_cols = data.select_dtypes(include=['object']).columns.tolist()
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
# Remove the target column 'make' from the features list
if 'make' in categorical_cols:
categorical_cols.remove('make')
if 'make' in numerical_cols:
numerical_cols.remove('make')
# Define the preprocessing steps for numerical and categorical columns
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_cols),
('cat', OneHotEncoder(sparse_output=False), categorical_cols) # Set sparse_output to False
])Show the code
# Split data into features and target
X = data.drop('make', axis=1)
y = data['make']
# Apply preprocessing
X_preprocessed = preprocessor.fit_transform(X)
# Encode the target variable
y_encoded = pd.get_dummies(y).valuesThe data is split into two parts: training and testing. The training set is used to train the model, and the testing set is used to evaluate its performance. This split ensures that we can test how well the model generalizes to new, unseen data.
Show the code
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y_encoded, test_size=0.2, random_state=123)5.1.1 Building the neural network model and training it
We chose to use a neural network. This neural network consists of layers of neurons, where each layer applies transformations to the data. The first layer takes the input features. Then some Hidden layers help the model learn complex patterns. We aslo added dropout layers to prevent overfitting. This layers help mitigate overfitting by randomly “dropping out” (ignoring) a fraction of the neurons during each training step. In the end, the output layer predicts the probability of each car manufacturer.
Show the code
# Define the neural network model
model = Sequential([
Input(shape=(X_train.shape[1],)),
Dense(128, activation='relu'),
Dropout(0.2),
Dense(64, activation='relu'),
Dropout(0.2),
Dense(y_train.shape[1], activation='softmax')
])
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])The model is trained using the training data. During training, the model learns by adjusting its internal parameters to minimize the difference between its predictions and the actual car manufacturers in the training data. The model is trained for a fixed number of iterations called epochs.
Show the code
# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)Epoch 1/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m4:26[0m 316ms/step - accuracy: 0.0000e+00 - loss: 4.8704
[1m115/845[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 439us/step - accuracy: 0.0710 - loss: 4.3326
[1m244/845[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 413us/step - accuracy: 0.1109 - loss: 3.9172
[1m374/845[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 404us/step - accuracy: 0.1393 - loss: 3.6688
[1m503/845[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 400us/step - accuracy: 0.1628 - loss: 3.4910
[1m632/845[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 398us/step - accuracy: 0.1829 - loss: 3.3525
[1m760/845[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 397us/step - accuracy: 0.2004 - loss: 3.2399
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 531us/step - accuracy: 0.2108 - loss: 3.1745 - val_accuracy: 0.4680 - val_loss: 1.8295
Epoch 2/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m6s[0m 8ms/step - accuracy: 0.4688 - loss: 1.7610
[1m129/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 393us/step - accuracy: 0.4413 - loss: 1.9222
[1m260/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 388us/step - accuracy: 0.4412 - loss: 1.9089
[1m391/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 386us/step - accuracy: 0.4432 - loss: 1.8918
[1m520/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 387us/step - accuracy: 0.4453 - loss: 1.8781
[1m651/845[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 386us/step - accuracy: 0.4477 - loss: 1.8649
[1m777/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 388us/step - accuracy: 0.4498 - loss: 1.8532
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 453us/step - accuracy: 0.4508 - loss: 1.8474 - val_accuracy: 0.5368 - val_loss: 1.5034
Epoch 3/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m6s[0m 7ms/step - accuracy: 0.5000 - loss: 1.7459
[1m129/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 392us/step - accuracy: 0.4757 - loss: 1.6311
[1m260/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 387us/step - accuracy: 0.4828 - loss: 1.6332
[1m392/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 385us/step - accuracy: 0.4866 - loss: 1.6275
[1m523/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 385us/step - accuracy: 0.4892 - loss: 1.6200
[1m654/845[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 385us/step - accuracy: 0.4916 - loss: 1.6130
[1m786/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 384us/step - accuracy: 0.4936 - loss: 1.6069
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 447us/step - accuracy: 0.4944 - loss: 1.6042 - val_accuracy: 0.5684 - val_loss: 1.3518
Epoch 4/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m5s[0m 7ms/step - accuracy: 0.5312 - loss: 1.8149
[1m132/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 383us/step - accuracy: 0.5502 - loss: 1.4624
[1m264/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 381us/step - accuracy: 0.5417 - loss: 1.4670
[1m396/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 380us/step - accuracy: 0.5379 - loss: 1.4703
[1m528/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 380us/step - accuracy: 0.5363 - loss: 1.4686
[1m660/845[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 380us/step - accuracy: 0.5359 - loss: 1.4649
[1m793/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 380us/step - accuracy: 0.5359 - loss: 1.4611
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 443us/step - accuracy: 0.5361 - loss: 1.4596 - val_accuracy: 0.5816 - val_loss: 1.2455
Epoch 5/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m6s[0m 8ms/step - accuracy: 0.5938 - loss: 1.5751
[1m130/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 390us/step - accuracy: 0.5540 - loss: 1.3852
[1m261/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 387us/step - accuracy: 0.5549 - loss: 1.3797
[1m392/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 385us/step - accuracy: 0.5551 - loss: 1.3752
[1m524/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 384us/step - accuracy: 0.5559 - loss: 1.3702
[1m657/845[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 383us/step - accuracy: 0.5563 - loss: 1.3673
[1m790/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 382us/step - accuracy: 0.5567 - loss: 1.3654
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 445us/step - accuracy: 0.5568 - loss: 1.3647 - val_accuracy: 0.6057 - val_loss: 1.1729
Epoch 6/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m6s[0m 7ms/step - accuracy: 0.5312 - loss: 1.2645
[1m131/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 386us/step - accuracy: 0.5885 - loss: 1.2711
[1m262/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 385us/step - accuracy: 0.5801 - loss: 1.2848
[1m393/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 384us/step - accuracy: 0.5781 - loss: 1.2858
[1m525/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 383us/step - accuracy: 0.5763 - loss: 1.2872
[1m657/845[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 383us/step - accuracy: 0.5748 - loss: 1.2878
[1m789/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 382us/step - accuracy: 0.5738 - loss: 1.2885
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 445us/step - accuracy: 0.5735 - loss: 1.2885 - val_accuracy: 0.6130 - val_loss: 1.1226
Epoch 7/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m6s[0m 7ms/step - accuracy: 0.7500 - loss: 1.0558
[1m131/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 387us/step - accuracy: 0.5894 - loss: 1.2383
[1m263/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 384us/step - accuracy: 0.5870 - loss: 1.2392
[1m394/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 384us/step - accuracy: 0.5860 - loss: 1.2390
[1m525/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 384us/step - accuracy: 0.5849 - loss: 1.2391
[1m657/845[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 383us/step - accuracy: 0.5842 - loss: 1.2379
[1m788/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 383us/step - accuracy: 0.5842 - loss: 1.2365
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 446us/step - accuracy: 0.5843 - loss: 1.2358 - val_accuracy: 0.6313 - val_loss: 1.0703
Epoch 8/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m6s[0m 7ms/step - accuracy: 0.7500 - loss: 0.8033
[1m131/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 387us/step - accuracy: 0.6126 - loss: 1.1663
[1m263/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 383us/step - accuracy: 0.6066 - loss: 1.1706
[1m395/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 382us/step - accuracy: 0.6032 - loss: 1.1741
[1m527/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 382us/step - accuracy: 0.6006 - loss: 1.1784
[1m660/845[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 381us/step - accuracy: 0.5991 - loss: 1.1810
[1m792/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 381us/step - accuracy: 0.5982 - loss: 1.1822
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 444us/step - accuracy: 0.5979 - loss: 1.1826 - val_accuracy: 0.6371 - val_loss: 1.0360
Epoch 9/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m6s[0m 7ms/step - accuracy: 0.6250 - loss: 1.1318
[1m131/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 386us/step - accuracy: 0.6174 - loss: 1.1099
[1m263/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 383us/step - accuracy: 0.6106 - loss: 1.1304
[1m395/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 382us/step - accuracy: 0.6070 - loss: 1.1394
[1m527/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 381us/step - accuracy: 0.6049 - loss: 1.1434
[1m660/845[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 381us/step - accuracy: 0.6035 - loss: 1.1458
[1m792/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 381us/step - accuracy: 0.6025 - loss: 1.1473
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 443us/step - accuracy: 0.6023 - loss: 1.1477 - val_accuracy: 0.6406 - val_loss: 1.0006
Epoch 10/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m5s[0m 7ms/step - accuracy: 0.5625 - loss: 1.1130
[1m131/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 387us/step - accuracy: 0.6196 - loss: 1.0847
[1m262/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 384us/step - accuracy: 0.6184 - loss: 1.0871
[1m393/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 384us/step - accuracy: 0.6158 - loss: 1.0952
[1m524/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 384us/step - accuracy: 0.6138 - loss: 1.1028
[1m655/845[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 384us/step - accuracy: 0.6125 - loss: 1.1070
[1m787/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 383us/step - accuracy: 0.6117 - loss: 1.1089
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 446us/step - accuracy: 0.6115 - loss: 1.1095 - val_accuracy: 0.6482 - val_loss: 0.9824
After training, the model’s performance is evaluated on the testing set. This evaluation measures how accurately the model can predict car manufacturers for new data it hasn’t seen before.
Show the code
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
[1m 1/264[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m1s[0m 7ms/step - accuracy: 0.5312 - loss: 0.8127
[1m228/264[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 221us/step - accuracy: 0.6340 - loss: 1.0089
[1m264/264[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 222us/step - accuracy: 0.6346 - loss: 1.0085
Show the code
print(f'Test accuracy: {accuracy}')Test accuracy: 0.6370738744735718
Show the code
# Make predictions
predictions = np.argmax(model.predict(X_test), axis=1)
[1m 1/264[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m4s[0m 15ms/step
[1m263/264[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 191us/step
[1m264/264[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 191us/step
Show the code
# Print predictions
print(predictions)[114 37 74 ... 36 19 19]
Once the model is trained and evaluated, it can be used to make predictions. Given a new set of car features, the model can predict which manufacturer produced the car.
Below you can see how the accuracy of the model evolves over the epochs. The training accuracy increases as the model learns from the training data, while the validation accuracy shows how well the model generalizes to new data.
Show the code
# Plot the accuracy and loss
fig, axs = plt.subplots(2, 1, figsize=(10, 10))
# Plot training & validation accuracy values
axs[0].plot(history.history['accuracy'])
axs[0].plot(history.history['val_accuracy'])
axs[0].set_title('Model accuracy')
axs[0].set_ylabel('Accuracy')
axs[0].set_xlabel('Epoch')
axs[0].legend(['Train', 'Validation'], loc='upper left')
# Plot training & validation loss values
axs[1].plot(history.history['loss'])
axs[1].plot(history.history['val_loss'])
axs[1].set_title('Model loss')
axs[1].set_ylabel('Loss')
axs[1].set_xlabel('Epoch')
axs[1].legend(['Train', 'Validation'], loc='upper left')
plt.tight_layout()
plt.show()Show the code
source(here::here("scripts","setup.R"))
library(data.table)
data_cleaned <- fread(here::here("data", "data_cleaned.csv"))In order to see the link between the features, we can use a dimension reduction technique such as the Principal Component Analysis, aiming to link the features according to their similarities accross instances and combine features in fewer dimensions.
6 Principal Component Analysis
6.1 Data Standardization
Show the code
data_prepared <- data_cleaned %>%
mutate(across(where(is.character), as.factor)) %>%
mutate(across(where(is.factor), as.numeric)) %>%
scale() # Standardizes numeric data including converted factors6.2 Heatmap
Show the code
cor_matrix <- cor(data_prepared) # Calculate correlation matrix
# Melt the correlation matrix for ggplot2
melted_cor_matrix <- melt(cor_matrix)
# Heatmap with all correlation coefficients displayed
ggplot(melted_cor_matrix, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") + # Add white lines to distinguish the tiles
geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 3.5) + # Always display labels
scale_fill_gradient2(low = "lightblue", high = "darkblue", mid = "blue", midpoint = 0, limit = c(-1,1),
name = "Spearman\nCorrelation") + # Use gradient2 for a diverging color scheme
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.title.position = "plot") +
labs(x = 'Variables', y = 'Variables',
title = 'Correlations Heatmap of Variables') # Adjust the title and labels as neededWe used this heatmap to check the correlation between the variables. As we can see, some variables seem to be strongly correlated, but most of them don’t seem to be too strongly correlated, whether positively or negatively. Let’s now look into the link between the features using a biplot, which combines the observations as well as the features.
6.3 Biplot
Show the code
pca_results <- PCA(data_prepared, graph = FALSE)
summary(pca_results)
Call:
PCA(X = data_prepared, graph = FALSE)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
Variance 4.616 3.735 2.126 1.292 1.019 0.988 0.856
% of var. 25.644 20.748 11.809 7.177 5.660 5.490 4.753
Cumulative % of var. 25.644 46.392 58.201 65.378 71.038 76.527 81.280
Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
Variance 0.827 0.765 0.549 0.510 0.349 0.193 0.137
% of var. 4.594 4.250 3.047 2.834 1.941 1.071 0.759
Cumulative % of var. 85.875 90.124 93.172 96.005 97.946 99.017 99.777
Dim.15 Dim.16 Dim.17 Dim.18
Variance 0.027 0.008 0.003 0.002
% of var. 0.150 0.045 0.018 0.011
Cumulative % of var. 99.926 99.971 99.989 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr
1 | 3.334 | -1.087 0.001 0.106 | 0.065 0.000
2 | 3.387 | -0.907 0.000 0.072 | -0.079 0.000
3 | 3.522 | -1.153 0.001 0.107 | -0.031 0.000
4 | 2.761 | -1.068 0.001 0.150 | 0.012 0.000
5 | 2.713 | -0.991 0.001 0.133 | -0.022 0.000
6 | 2.713 | -0.991 0.001 0.133 | -0.022 0.000
7 | 2.847 | -1.039 0.001 0.133 | -0.066 0.000
8 | 2.876 | -1.151 0.001 0.160 | -0.019 0.000
9 | 4.850 | -1.810 0.002 0.139 | 0.346 0.000
10 | 3.313 | -1.686 0.001 0.259 | 0.231 0.000
cos2 Dim.3 ctr cos2
1 0.000 | 0.018 0.000 0.000 |
2 0.001 | 2.589 0.007 0.584 |
3 0.000 | 2.603 0.008 0.546 |
4 0.000 | 0.658 0.000 0.057 |
5 0.000 | 0.609 0.000 0.050 |
6 0.000 | 0.609 0.000 0.050 |
7 0.001 | 0.490 0.000 0.030 |
8 0.000 | 0.546 0.000 0.036 |
9 0.005 | -0.005 0.000 0.000 |
10 0.005 | 1.551 0.003 0.219 |
Variables (the 10 first)
Dim.1 ctr cos2 Dim.2 ctr cos2
make | 0.106 0.242 0.011 | -0.092 0.229 0.009 |
model_year | 0.347 2.605 0.120 | 0.032 0.028 0.001 |
vehicle_class | -0.141 0.428 0.020 | 0.051 0.071 0.003 |
drive | -0.038 0.031 0.001 | 0.003 0.000 0.000 |
engine_cylinders | 0.077 0.130 0.006 | -0.073 0.141 0.005 |
engine_displacement | 0.125 0.338 0.016 | -0.104 0.292 0.011 |
transmission | -0.498 5.376 0.248 | 0.069 0.128 0.005 |
fuel_type_1 | -0.419 3.802 0.175 | 0.223 1.328 0.050 |
city_mpg_fuel_type_1 | 0.837 15.173 0.700 | -0.308 2.532 0.095 |
highway_mpg_fuel_type_1 | 0.800 13.860 0.640 | -0.313 2.630 0.098 |
Dim.3 ctr cos2
make -0.472 10.475 0.223 |
model_year -0.120 0.679 0.014 |
vehicle_class 0.474 10.568 0.225 |
drive 0.158 1.172 0.025 |
engine_cylinders 0.755 26.797 0.570 |
engine_displacement 0.851 34.040 0.724 |
transmission -0.018 0.016 0.000 |
fuel_type_1 -0.170 1.358 0.029 |
city_mpg_fuel_type_1 -0.242 2.753 0.059 |
highway_mpg_fuel_type_1 -0.351 5.790 0.123 |
Show the code
fviz_pca_biplot(pca_results,
geom.ind = "point",
geom.var = c("arrow", "text"),
col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)The biplot shows several information. First, the two dimensions epxlain almost 50% of the total variance of the data. Then, each point represents an observation and its color represent the quality of the representation of the variables. Looking at the cos2 gradient, the redder the dot, the better the quality of representation. Then, the arrows (or vectors) represent the features. The vectors pointing in similar directions represent some positive correlations, whereas the ones going opposite represent negative correlations. Orthogonal variables are not correlated.
Taking all of these into account, we can interpret this graph in the following way: we notice that the variables linking the mpg and the range for a fuel type (e.g. fuel type 1) go in the same direction and all seem to be positively correlated, and they are uncorrelated to the same characteristics of the other fuel type (e.g. fuel type 2). Also, the mpg and range seem to be negatively correlated to their own fuel type. Moreover, fuel type 1 is uncorrelated to fuel type 2, which makes sense.
6.4 Screeplot
Show the code
# Use PCA results to generate the screeplot
fviz_eig(pca_results,
addlabels = TRUE,
ylim = c(0, 100),
barfill = "lightblue",
barcolor = "black",
main = "Scree Plot of PCA") Taking the screeplot into account, 7 dimensions are needed to reach at least 80% of the variance, meaning the features might be relatively independent. It is already shown in the biplot above, as most arrows in the middle seem to be shorter and the cos2 are low, meaning that the features might be more linked to other dimensions than the first 2 dimensions. As a result of the biplot, we will try to provide a clustering in order to potentially see the different clusters of features similarities.
7 Clustering
7.1 Clusters
Show the code
# Extract PCA coordinates for clustering
pca_coords <- pca_results$ind$coord
# Using the Elbow Method to determine the number of clusters
set.seed(123)
wss <- sapply(1:15, function(k) {
kmeans(pca_coords, centers = k, nstart = 25)$tot.withinss
})
# Elbow Plot
plot(1:15, wss, type = "b", pch = 20, frame = FALSE,
xlab = "Number of clusters K", ylab = "Total within-clusters sum of squares")Show the code
# Optimal number of cluster according to Elbow
k <- 4
# K-means
set.seed(123)
km_result <- kmeans(pca_coords, centers = k, nstart = 25)
# Add cluster results and observation names to the PCA coordinates data frame
pca_coords_df <- as.data.frame(pca_coords)
pca_coords_df$cluster <- factor(km_result$cluster)
pca_coords_df$name <- rownames(data_cleaned)
# Visualize the clusters using PCA and label points with observation names
ggplot(pca_coords_df, aes(x = Dim.1, y = Dim.2, color = cluster, label = name)) +
geom_point() +
geom_text(check_overlap = TRUE, vjust = 1.5, size = 3) +
labs(title = "Cluster Plot", x = "Dim1 (20%)", y = "Dim2 (20%)") +
scale_color_manual(values = c("lightblue", "lightpink", "lightgreen", "lightgrey")) +
theme_minimal() +
theme(legend.position = "right") 7.2 Biplot & Clusters
Show the code
# Calculate Cluster Centers
cluster_centers <- aggregate(pca_coords_df[, 1:2], by = list(cluster = pca_coords_df$cluster), FUN = mean)
# Biplot with clusters
fviz_pca_biplot(pca_results,
geom.ind = "point",
geom.var = c("arrow", "text"),
col.ind = factor(km_result$cluster),
palette = c("lightblue", "lightpink", "lightgreen", "lightgrey"),
addEllipses = TRUE,
ellipse.level = 0.95,
repel = TRUE,
legend.title = "Cluster") +
geom_text(data = cluster_centers, aes(x = Dim.1, y = Dim.2, label = paste("Cluster", cluster)),
color = "black", size = 5, vjust = -1) +
labs(title = "PCA - Biplot",
x = paste("Dim1 (", round(pca_results$eig[1,2], 1), "%)", sep = ""),
y = paste("Dim2 (", round(pca_results$eig[2,2], 1), "%)", sep = ""))Using a clustering methods allow us to see the number of clusters that share similarities. Here, the elbow indicates that the optimal number of clusters should be 4. When comparing it with the biplot (last graph), we clearly notice that cluster 1 refers to the characteristics (range and mpg) for fuel 1 and same for fuel 4. Cluster 2 and 3 are packed towards the center and it is harder to see which features are similar, which is explainable with the fact that we would need around 6 dimensions to explain the total variance. We can suppose that cluster 2 refers to the class and the vehicle transmission, whereas cluster 3 focuses on engines and cylinders.